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I. INTRODUCTION 



Computational approaches to protein folding are divided into two main categories. In energy minimization 
methods the native state is identified with the the ground state of a suitable energy function (Brooks, 1998). 
In fold recognition methods, the native state is selected as the most compatible structure among those present 
in a library (Bowie et ai, 1992; Jones et aL, 1992; Fisher et ai, 1996). Both approaches depend on three 
important choices: 1) The representation of protein structure; 2) The set of alternative structures among 
which the native fold is sought for; 3) A bias towards the "best" conformation. In energy minimization 
approaches such a bias is an energy function, and the best conformation is the one of lowest energy. On the 
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other hand, in fold recognitfon methods, a compatibihty function for a sequence on a structure is used. The 

compatibility is often expressed in terms of database-derived properties and restraints. 

In this review we analyze the attempt to use contact maps to efficiently perform protein fold prediction. 
Contact maps are a particularly manageable representation of protein structure which has been already 
applied in the past to the study of protein conformation (Chan and Dill. 1990), structure comparison (Holm 
and Sander, 1993), interaction patterns in proteins (Lifson and Sander, 1979; Godzik et ai, 1993) and 
correlated mutations (Olmea and Valencia, 1997; Ortiz et al, 1998). The possibility of performing energy 
minimization in the space of contact maps has been proposed by Mirny and Domany (1996). We present 
here the consistent development of their idea, discussing successes, failures and perspectives. 

II. STRUCTURE REPRESENTATION 

We discuss the problem of representations of protein structure and give the definition of contact 
maps. 

Following Anfinsen's thermodynamic hypothesis (Anfinsen, 1973), the native state of a protein is commonly 
assumed to be the minimum of a free energy function. This is a powerful assumption and Molecular Dynamics 
is the most direct method to implement it (Brooks, 1998). The structure is represented by listing the 
coordinates of all the atoms and Newton's equation of motion are solved in a suitable force field tuned 
for molecular systems. Unfortunately, present computers cannot follow the trajectory of a protein all the 
way down to its native state. The best result to date is the simulation of 1 /zs trajectory of villin headpiece 
subdomain in water which allowed the detection of the hydrophobic collapse and of the formation of secondary 
structures (Duan and KoUman, 1998). Furthermore, the method depends crucially on the determination of 
suitable energy parameters. An incorrect energy function, which does not assign the lowest energy to the 
native conformation, leads a careful energy minimization procedure to some misfoldcd conformation, as for 
example shown by Karplus and collaborators in the case of hemerythrin and the immunoglobulin VL domain 
(Novotny et at, 1984). 

We believe that perhaps one can solve the problem without going to such microscopic detail and we set 
about to investigate this assumption. The problem of structure representation is to find the best trade-off 
between computability and accuracy of the predictions. Our inclination is that good predictions can be 
obtained by constructing simplified models. 

Lattice models offer the possibility to gradually turning on the complexity of the representation of the 
structure. Usually a protein is represented as a chain of monomers occupying lattice sites and representing 
Ca atoms. The complexity can be measured by the number of states available to each monomer (Park and 
Levitt, 1995). The lowest possible complexity is that of tetrahedral and simple cubic lattices, where 3 and 5 
states per monomer arc respectively available. Lattices of high coordination number, up to 55 states, have 
been studied (Ortiz et al, 1998). The main motivation to study lattice models is that at low complexity it is 
possible to effectively solve the problem of searching the ground state, either exactly, by enumerating all the 
conformations or approximately, by Monte Carlo methods. By using Monte Carlo simulations, solutions can 
be routinely obtained for polymers up to length 125 (Sali et al., 1994; Dinner et al, 1996). Using pairwise 
contact energy functions, evidence for important features of the folding process has been produced, most 
notably, the hydrophobic collapse. It is possible to consider more detailed lattice models which still retain 
some of the advantage in computability and allow a more realistic representation of protein structure: for 
example, it is possible to represent side chains by an additional virtual atom. Advances in this direction 
have been recently reported by Skolnick and collaborators (Ortiz et al., 1998). Interestingly, the accuracy 
increases very slowly with the complexity, and the typical resolution of structures in the Protein Data Bank 
(PDB) (Bernstein et al., 1977) can be obtained with models with 10 to 20 states (Park and Levitt, 1995). 
The Ca model can be simulated also off-lattice. For short chains and simple interactions it is possible to 
identify the ground state with reasonable reliability (Clementi et al., 1998; Irback and Potthast, 1995). 

A minimalistic representation of a protein's structure is given by its contact map (Lifson and Sander, 1979; 
Chan and Dill, 1990; Godzik et al., 1993; Hold and Sander, 1993; Mirny and Domany, 1996; Vendruscolo et 
al., 1997). The contact map of a protein with N residues is an N x N matrix S, whose elements are defined 
as 
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if residues i and j are in contact 
otherwise 



(1) 



One can define contact between two residues in different ways; one is to consider two amino acids in contact 
when their two Ca atoms are closer than some threshold Rc {''Ca' definition) (Vendruscolo et al., 1997). 
Another definition is based on the minimal distance between two atoms that belong to the two residues 
(ffinds and Levitt, 1994; Mirny and Domany, 1996) ("all atoms" definition). In a contact map a-helices 
appear as thick bands of contacts along the diagonal, /3-sheets as bands running parallel or perpendicular 
to the diagonal. Given all the inter-residue contacts or even a subset of them, it is possible to reconstruct 
quite well a protein's structure, by means of either Distance Geometry (Crippen and Havel, 1988), Molecular 
Dynamics (Briinger et al., 1986) or Monte Carlo (Vendruscolo et al., 1997). 

In contrast to Cartesian coordinates, the map representation of protein structure is independent of the 
coordinate frame. This property made contact maps attractive for protein structure comparisons and for 
searching a limited database for similar structures (Chan and Dill, 1990; Godzik et al, 1993; Hold and 
Sander, 1993). 

One of the main reasons for selecting the contact map representation of structure is our expectation and 
hope that we may be able to search the space of contact maps in an efficient manner and find low energy maps 
(that hopefully correspond to conformations close to the native one). In particular, one hopes that relatively 
simple changes on a map may generate very substantial coherent moves of the corresponding polypeptide 
chain conformation; moves which would have taken much longer to achieve by working with the chain itself. 

In order to actually implement such moves in the space of contact maps, one has to overcome two important 
problems. The main and foremost one is the need to ensure that the map S that has been generated is 
physical. Our definition of "physical" will be given in detail below; broadly speaking, we mean that there 
exists a chain conformation whose contact map is indeed our proposed S. Arbitrary changes performed on 
a map yield, with very high probability, non-physical maps. The reason is that the total number of possible 
N X N contact maps is 0(2'"^ ), whereas the number of physical maps is much smaller, of order 0(2''''^). 
The need for a procedure that limits the search to the small subspace of physical maps was identified by 
Mirny and Domany (1996), who proposed a restricted set of moves. The hope was that when moves selected 
from this set are performed on a physical map, the new map is also physical. These rules were, however, 
heuristic and no clear proof of the validity of this hope could be given. 

Subsequently, the problem of generating physical maps was dealt with by means of a reconstruction 
procedure (Vendruscolo et al, 1997). For any proposed contact map S one generates a chain, with S serving 
as the guide of the construction process. The procedure stops when the contact map S' of the resulting chain 
is close to the target map; this way one generates a map which is physical and close to the target. 

The second important issue is the manner in which low energy maps are generated from an existing one 
(Vendruscolo and Domany, 1998a). The procedure has to be such that the resulting map is "protein-like" , i.e. 
has secondary structure elements and the corresponding chain has the right density, bond-angle distribution, 
chirality, etc. The final map should be physical and the decorrelation time with the starting map should be 
short. 

In what follows we sketch how these two problems were addressed; for a more detailed description of these 
procedures the reader is referred to the original publications (Vendruscolo et al., 1997; Vendruscolo and 
Domany, 1998a). 



We present a method to obtain a three dimensional polypeptide conformation from a contact map. 
We also explain how to deal with the case of non physical contact maps. 

The aim is to find an efficient procedure, which can be performed "on line" and in parallel with the 
dynamics in the space of contact maps, which will "project" any map onto a nearby one which is guaranteed 
to be physically realizable. The protein is represented as a "string of beads" , in which each bead stands 
for an amino acid - the coordinates of center of a bead are identified with those of the corresponding Ca 
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atom. For a given target contact map S, the algorithm searches for a conformation that this string of 
beads can take, such that the contact map S' of our string is similar (or close) to S. If there exists a chain 
conformation whose contact map is identical with S, this contact map is, by definition, physical. In general, 
our method aims at converting a possibly ill-defined, non-physical set of contacts to a legitimate one. The 
three dimensional structure is in our case a means, rather than the end. 

It should be noted that related, previously developed methods (Havel et at, 1979; Briinger et at, 1986; 
Bohr et ai, 1993; Nilges, 1995; Lund et ai, 1996; Aszodi and Taylor, 1996; Mumenthaler and Brown, 1996; 
Skolnick et al, 1997) a different aim; to construct three dimensional structures from measured distance 
information, using various forms of distance geometry (Crippen and Havel, 1988), supplemented by restricted 
Molecular Dynamics (Scheek et ai, 1989) or simulated annealing (Briinger et ai, 1997). 

One should emphasize here the distinction between a contact map and a distance map. In a contact map 
a minimal amount of information is available - given a pair of amino acids, we know only if they are in 
contact or not. That is, only lower and upper bounds on their separation are given. A distance matrix, on 
the other hand, presents real-valued distances between pairs of amino acids. The method presented here 
is not restricted to contact maps and has been generalized to distance maps (Vendruscolo and Domany, 
unpublished). The deviations between different structures that were reconstructed from the same contact 
map are typically much higher than those between two structures derived from a distance matrix. 

The proposed algorithm is divided into two parts. The first part, growth, consists of adding one monomer 
at a time, i.e. a step by step growth of the chain. The second part, adaptation, is a refinement of the 
structure, obtained as a result of the growth stage, by local moves. In both stages, to bias the dynamics, we 
introduce cost functions defined on the basis of the contact map. Such cost functions contain only geometric 
constraints, and do not resemble the true energetics of the polypeptide chain. 



A. Growth 



The first element of the growth is single monomer addition., which is carried out in the spirit of the 
Rosenbluth method (Rosenbluth and Rosenbluth, 1955). To add monomer i to the growing chain we generate 
at random Nt trial positions, (typically Nt = 10), 

rp' =r,_i+r(^) , (2) 

where j = 1, . . . ,Nt. The length and the directions of r'^-'^ are set from a statistical analysis of PDB. One 
out of the Nt trials is chosen according to the probability 

— r'^' /t 

where Eg is a cost function which rewards contacts that should be present, according to the given contact 
map, and discourage contacts that should not be there. 

The second element of the procedure is chain growth. The step by step growth just discussed optimizes 
the position of successive amino acids along the sequence. The main difficulty in the method is that the 
single step of the growing chain has no information on the contacts that should be realized many steps (or 
monomers) ahead. To solve this problem, we carry out several attempts (typically 10) to reconstruct the 
structure, choosing the best one. In practice, this is done as follows. 

For each attempt, when position rl-'*^'-'! is chosen for monomer i according to Eq. (|^), its probability is 
accumulated in the weight 



w^.^n^'P^^ (4) 

k=l 



When we have reached the end of the chain we store the weight Wn- The trial chain with the highest Wn 
is chosen. 
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The cost Junction for growth is 

i-l 

k=l 

where r'"^ — jrp'' — r^l. The enhancing factor d = i — k \s introduced to guide the growth towards contacts 
that are long ranged along the chain; d is the Heaviside step function and the constant a.g can take two 
values; ag{Sik = 0) > and a,g{Sik = 1) < 0. That is, when a contact is identified in the chain, i.e. r^^ < dt, 
it is either rewarded (when the target map has a contact between i and k), or penalized. 



B. Adaptation 

When we have grown the entire chain of N points, we refine the structure according to the following 
scheme. We choose a point i at random and try, using a crankshaft move (Sali et ai, 1994), to displace it to 
rj, keeping fixed the distances from both points i — 1 and i + 1. We use a local cost function Ea^ : 

N 

fe=l 

where r[f. = |r- — rfe|. Note that the enhancing factor d has been omitted, so that fa does not favor contacts 
between monomers that are distant along the chain. The displacement is accepted with probability tt, 
according to the standard Metropolis prescription 

TT^ mm{l,cxp{- AEa/Ta) (7) 

where A.Ea is the change in the cost function Ea induced by the move and and Ta is a temperature-like 
parameter, used to control the acceptance ratio of the adaptation scheme. 

A key ingredient of our method is annealing (Kirkpatrick et al., 1983). As in all annealing procedures, 
the temperature-like parameter Ta is decreased slowly during the simulation to help the system find the 
ground state in a rugged energy landscape. In our method, however, instead of using simulation time as 
a control parameter on the temperature, we chose the number n of missing contacts. Two regimes were 
roughly distinguished. In the first regime many contacts are missed and the map is very different from the 
target one. In the second regime few contacts are missed, and the map is close to the target. The parameters 
a,a and Ta are interpolated smoothly between values suitable for these two limiting cases. In the first regime, 
we strongly favor the recovery of contacts that should be realized, whereas in the second regime we strongly 
disfavor contacts that are realized but should not be present. We set, as shown in Fig. ^ 

ai")(5) = a/(5) + [a' (5) - a/(5)]a(n) , (8) 

and 

Tin) ^ Tf ^ (J.. „ _ (9) 

The function (T{n) interpolates between the initial value a* and the final value a-'', 

^(") = TT^^^ - 1 • (10) 

1 + e 9 

By choosing a% a^, T^, TI and ag we define the two regimes, far from and close to the target map. 
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FIG. 1. Annealing functions for the parameters used in adaptation. See Eqs. (|8|) and (|9|). 

We have tried several alternatives to each of the components of the method outlined above. For a detailed 
description of these, we refer the reader to Vendruscolo et al. (1997). We present here a brief description of 
some selected results that were obtained using the algorithm presented above. 



C. Results 



1. Experimental contact maps. 



In this section we present results about the reconstruction of experimental contact maps as taken from 
PDB. Since our purpose, as explained in the Introduction, is to use the reconstruction in connection with 
dynamics, we chose the contact length i?c = 9 A to obtain the most faithful representation of the energy of 
the protein (Mirny and Domany, 1996). Such a threshold is determined by the requirement that the average 
number of Cq, — Ca contacts for each amino acid is roughly equal to the respective numbers obtained with 
the all-atom definition of contacts. 

The most commonly used dissimilarity measure between structures is the root mean square (RMS) distance 
D (McLachlan, 1979) 



D = 



\ 



1 ^ 

i=l 



where one structure is translated and rotated to get a minimal D. 

The dissimilarity measure between contact maps is defined as the Hamming distance 



Z?--P=^|5,,-5,:^.|, (12) 



which counts the number of mismatches between maps S and S'. 

We present in Fig. ||a the results over 100 reconstruction runs for chains for the protein 6pti (bovine 
pancreatic trypsin inhibitor). 
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FIG. 2. (a) Histogram of the distance D for the 100 runs used to test the reconstruction procedure. Data are 
presented for protein 6pti. (b) Contact maps for protein 6pti for a threshold dt = 9 A. Full squares are the PDB data, 
open squares the output of the reconstruction procedure. None of the target contacts is missed and two spurious ones 
are added (the arrows point at their locations. (Adapted with permission from Vendruscolo et al. (1997). Copyright 
1997 Current Biology.) 

In Fig. ||b we show the contact map for the protein 6pti, N = 56, as taken from PDB, that was used 
as a target to construct a chain. The contact map of a typical reconstructed chain is also shown. In this 
particular case none of the 342 original contacts were missed and only two false positive contacts were added. 
These are close to clusters of correct contacts, indicating slight local differences with the crystallographic 
structure. The distance recorded in this case was D — 1.56. 

We carried out extensive similar tests for other proteins of various lengths (Vendruscolo et ai, 1997). Our 
method produces, using a native contact map as target, a structure whose contact map is in nearly perfect 
agreement with the target. Furthermore, the distance of this reconstructed chain from the native structure 
is quite close to the resolution that can be obtained from the information contained in contact maps. 

2. Non-physical contact maps 

Our main purpose is to develop a strategy to construct a three dimensional structure, starting from a 
given set of contacts, even if these contacts are not physical, i.e. not compatible with any conformation 
allowed by a chain's topology. In such a case we require our procedure to yield a chain whose conformation 
is as "close" as possible to the contact map we started with. The exact measure of such closeness depends 
on the source of non-physicality, as will be demonstrated in two examples described below. 

Our first examples of non-physical contact maps were obtained by randomizing a native contact map; 
this was done by flipping M randomly chosen entries. Contacts between consecutive amino acids (neighbors 
along the chain) were conserved. 

A typical contact map with noise is shown in Fig. ^. The protein is Itrm chain A, whose contact map 
has 1595 contacts, when the threshold is set to 9 A. We show the native map and the target map obtained 
by flipping at random M = 400 entries of the native map, together with the map produced by our method. 
For the particular case shown, the distance to the crystallographic structure D = 2.4 A. 
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FIG. 3. (a) Above diagonal: reference map (open circles) obtained by randomizing the underlying physical map 
(full squares) of protein Itrm chain A. Below diagonal: reconstructed contact map (open square) obtained using 
the noise-corrupted map as target, (b) Average distances (D) versus noise AI for protein Itrm A. (Adapted with 
permission from Vendruscolo et al. (1997). Copyright 1997 Current Biology.) 

The most important conclusion that can be drawn from Fig. ^ is that isolated non-physical contacts are 
identified as such and ignored and the underlying physical contact map is recovered. 

The dependence of this recovery on the noise level is shown in Fig. where we present the average distance 
of the final structure from the uncorrupted Itrm A contact map, for various values of M. Averages were 
taken over 10 different realizations of the noise, and over 10 reconstruction runs for each realization. The 
distance for totally unrelated structures for Itrm A is around 15 A. It is remarkable that up to M < 1000 
a fair resemblance to the experimental structure is still found. Even with the addition of a noise which is 
around 60% of the signal, the reconstruction procedure works. We have found similar result for the smaller 
protein 6pti, which has 342 contacts, and can be fairly well reconstructed with a noise of up to 200 flipped 
contacts. 

To summarize this section, for physically realizable target contact maps our method is very fast and reliable 
to find a chain conformation whose contact map is nearly identical to the target. Moreover, a the method 
is able to find a good candidate structure even when the target map has been corrupted with non-physical 
contacts. The information contained in a known native contact map suffices to reconstruct a conformation, 
which is relatively close to that of the original structure, as was already observed by Havel et al (1979). 
There is, however, an intrinsic limit in the resolution of a contact map. We used a threshold of 9 A between 
Ca atoms to define contact; for this threshold the distance between two typical structures, that are both 
compatible with the contact map, is about 1 A. The threshold of 9 A is relevant for our purpose, of working 
with contact energies in a scheme to derive structure from sequence. 



IV. DYNAMICS IN CONTACT MAP SPACE 



We describe a stochastic method to perform dynamics in contact map space. We explain how the 
motion is restricted to physical regions of the space. 

Our aim is to generate a large number of contact maps that can serve as candidates for the native structure. 
Such maps are necessary for protein folding by means of energy minimization, as well as in order to generate 
decoys needed to test properties of various energy functions. Hence the requirements from any procedure 
that generates such maps are 

• The generated maps should be physical. 
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• The maps should be "protein-hke" ; for example they should have secondary structure elements 

• The maps should have low values of the energy (defined in terms of the sequence and the contact map) 

• Efficiency - in order to generate large numbers of independent maps in reasonable computing time. 



The requirement of physicality is addressed by the method described in the Sec. Ill; whenever a new 
candidate map is generated, we use it as the target map of the reconstruction procedure, and obtain, this 
way, a contact map which corresponds to a physical "chain of beads" . 

In order to move efficiently in contact map space in a way that satisfies the requirements listed above, we 
introduced a four-step procedure which is delined below. For further details we refer to Vendruscolo and 
Domany (1998a). 

1. Non-Local Dynamics: 

Starting from an existing map, we perform large scale "cluster" moves. Clusters are in approximate 
correspondence with secondary structure elements. At this stage, no attempt is made to preserve 
physicality. The contact map which is obtained by this procedure is typically uncorrelated to the 
starting one. 

2. Local Dynamics: 

The resulting map is refined by using local moves of different kinds. Secondary structure formation 
can be viewed as a "growth" process. Starting from a random coil, an a helix is formed by twisting 
one turn at a time (Chakrabartty and Baldwin, 1995). Analogously, an helix can translate locally by 
untwisting a turn at one extreme and reforming it at the opposite end, in a movement reminiscent to 
the reptation dynamics of polymers (de Gennes, 1979). A P sheet is created and removed by zipping 
and unzipping together two /3 strands (Muiioz et al, 1997). We also use the conservative dynamics 
introduced by Mirny and Domany (1996) to further refine the resulting map. 

3. Reconstruction: 

We use the previously introduced reconstruction algorithm (Vendruscolo et al, 1997) to restore phys- 
icality by projecting the map obtained from the second step onto the physical subspace. 

4. Refinement: 

We perform further optimization by energy minimization in real space using standard crankshaft moves 
(Sali et ai, 1994; Vendruscolo and Domany, 1998a). 

The projection procedure from a contact map to its three dimensional counterpart is the bottleneck of the 
method. The dynamic rules that we introduce are aimed at generating uncorrelated starting points for this 
reconstruction. In this way, after each four-step move, we obtain a good candidate map for the native state. 



The contact energy (eq. 15) with some standard parametrization (i.e. choice of the w{ab)) is used in steps 



1, 2 and 4 following the standard Metropolis prescription for the acceptance of a move. 



A. Non-local dynamics 

Rules of non-local dynamics have been introduced in the context of the simulation of equilibrium properties 
of spin systems (Kandel and Domany, 1991) and of surfaces (Evertz et ai, 1991). Under suitable conditions, 
systems with a large number of degrees of freedom arrange themselves in conformations where the degrees 
of freedom are "coherently" grouped together. Using an incoherent dynamic procedure the time it takes 
to go from one coherent conformation to another can be prohibitively long. Physical intuition guides the 
choice of non-local rules to obtain an efficient dynamics. Since in our case we are developing a minimization 
algorithm, we are not concerned with detailed balance, and we can optimize the dynamics by choosing moves 
that minimize the energy. 

We now present the physical considerations that guided our choice of non-local moves. The unknown 
interactions between amino acids dictate the rules that determine the stability of protein folds. Such rules 
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govern the chain topology in a rather stringent way. The overall number of existing protein families is 
estimated to be around 1000 (Chothia, 1992; Orengo et ai, 1994). Protein families are characterized by 
particular arrangements of secondary structures. Secondary structure elements are easily identified also in a 
contact map as clusters of points of characteristic size, shape and location. In the contact map representation, 
secondary structures can be handled very efficiently by binary operations. For example a parallel (3 sheet is 
created by turning from to 1 a set of contacts forming a cluster with the shape of a thin band parallel to 
the main diagonal. To turn a 2-a bundle from an up-down topology to the alternative up-up one, a rotation 
of a cluster of points is required. The operations described provide only a scaffold, which is non physical, and 
must be rectified by the other two steps of the dynamics. The procedure yields every time a completely new 
topology. A MD simulation could obtain the same result only by completely (or at least partially) unfolding 
and refolding the protein. 

In contact maps of experimentally determined protein structures clusters of contacts can be divided in 
four classes. Thick bands of adjacent contacts along the main diagonal represent a helices (see region 1 in 
Fig. Qa). Thin bands represent parallel P sheets if they are parallel to the main diagonal (region 2 in Fig. 
^) and antiparallel sheets if they are antiparallel (region 3 in Fig. |^a). Small clusters or isolated points 
represent structurally relevant contacts between amino acids that are well separated along the sequence. 
These features characterize protein-like contact maps and should be preserved by the dynamics in contact 
map space. 

As a preliminary information we determine the expected number N* of contacts and the number N* of 
clusters that are expected to appear in the contact map. These numbers will be stochastically conserved 
during the dynamics. We have already presented evidence to the effect (Vendruscolo et ai, 1997) that 
N* = aN^ , where N is the length of the protein, ~ 1 and a depends on the threshold that is used to define 
a contact. As for iV*, there are algorithms to predict the secondary structure content, like the PHD (Rost 
and Sander, 1993) or the GOR (Garnier et ai, 1996) algorithms. Alternatively, having a good starting guess 
for the native contact map, one can directly determine N* and TV*. The cluster algorithm is divided into 
three steps: labeHng, deletion and creation. 

1. Labeling: 

Starting from an existing map, the first step is to identify the clusters that are present, which is 
done using the Hoshel-Kopelman algorithm (Stauffer and Aharony, 1992). With our definition of 
contact, contacts Sij with \i — j\ < 2 are always present due to chain connectivity, see Fig. Our 
dynamical rules do not violate these topological constraints. Cluster labeling is made in the upper 
triangle, excluding the first three diagonals. By symmetry, it is sufficient to perform all the dynamics 
in this region. At this stage we calculate the number Nc of contacts and the number Ng of secondary 
structures. Secondary structures are defined as cluster of more than H = 10 points. After this labeling 
procedure, each point in the map has label L{i,j) = (C, K) with a class C and a number K inside 
the class. Five classes are considered. In class a we put the clusters that are formed by bands along 
the main diagonal. In class /3|| we put the clusters that constitute bands parallel to the main diagonal 
but apart from it. In class (3± we group clusters that are in the form of bands perpendicular to the 
main diagonal. In the fourth class we gather all the remaining irregular clusters. In the last class we 
put all the points which do not belong to any cluster (e.g. isolated points). 

2. Destruction: 

N~ existing clusters are deleted from the map. N~' is chosen from a uniform random distribution 
between 1 and Ng. Destruction is simply realized by choosing at random a label (C, K) and by turning 
contacts in the corresponding cluster from 1 to 0. 

3. Creation: 

clusters are created in the map. is drawn from a gaussian distribution of mean N* — {Ng — N^) 
and variance 1. li Ng — > N* then 7V+ = 0. Each time we make M attempts to create a cluster 
(typically M = 100), and we choose the one with the more favorable energy, according to Eq. (|l5|). At 
each creation we first decide with probability P{a), f'(/3||) and P{P±) whether to grow an a, a /Jy or 
a P±- Typically P{a) = P{f3\\) — P{l3±) — 1/3. The length of the created band is a uniform random 
number in [5,30] for a, in [5,12] for and Creation starts by selecting randomly a seed point 



10 



on the map. For a clusters this point is chosen on the principal diagonal, for /3|| at a point displaced 
from the principal diagonal by more than the length of the cluster. No restrictions are imposed on the 
seed of From this point we lay down a cluster in the form of a band as shown in Fig. We do 
not allow secondary structures to overlap or to be closer than 4 spacings on the map, since this is not 
commonly observed in actual contact maps. If while growing the cluster we encounter a point which 
already has a label that violates this condition we restart the creation. The result of a non-local move 
for lubq, starting from the native map shown in Fig. 0a is shown in Fig. Eb. 



B. Local dynamics 

The principal aim of these moves is to allow local rearrangements of the secondary structures that have 
been placed by the non-local dynamics. We first give the dynamical rules to deal with a helices. Consider 
an helix of n amino acids, which we have previously identified as starting from amino acid i and ending 
on z -f n. Typically, n ranges from 5 to 30. To increase the size from the head, we add the two contacts 
(?' + rt + l,2 + n — 2), + n + + n — 3). The tail is increased by adding the two contacts above the diagonal 
(« — 1, i + 2) and (i — 1, z + 3). To decrease the size of the helix, one removes the contacts (i -I- n, i -I- n — 3) 
and (i + n, z + rt — 4) on the head and (z, z -I- 3) and (z, z + 4) on the tail. To translate the helix, one performs 
a reptation-like move in which one turn is removed from one end and added to the other, by using the same 
rules. 

Similar rules govern the growth and the translation of sheets. Consider first an antiparallel /3 sheet formed 
by two strands. The first strand extends from amino acids z to z -I- n and the second from j to j -1- m. 
By unzipping amino acids i -V n and j, we reduce the size at the end closer to the main diagonal. This 
move is realized by setting to (irrespective of their state) the five contacts (z -I- n, j -I- 2), (z -I- n — 1, j -I- 2), 
(z-|-n — 1, (z-l-n — 2, j -I- 1) and (z-|-rz — 2, j). Opening the sheet from the other side is realized by setting 

to the contacts (j-t-2, j + m), (z + 2, j + rn — 1), (z -I- 1, j -|-m — 1), 1, j-|-to — 2) and (i, j -|-to — 2). Zipping 
together the ends is realized by setting the corresponding contacts to one. Translating the sheets, as in the 
case of helices, is realized by opening one end while closing the other. Rules that are entirely similar are 
applied to parallel /3 sheets. In the general case, the sheet might present irregularities which would appear 
as supplementary contacts at the extremities. Since we do not attempt here to realize a physical map we 
implement these simple rules and rely on the reconstruction procedure to take care of the local structural 
details. The resulting contact map after this step is shown in Fig. |^c. We use the conservative dynamics 
introduced by Mirny and Domany (1996) to further refine the resulting map; the result is shown in Fig. 
Typically minor local rearrangements take place. 



C. Reconstruction 



As already observed, a generic contact map is not guaranteed to correspond to any real chain conformation 
in space. This is likely the case of the contact map obtained after the first two steps of the dynamics. By using 
the reconstruction method presented above we project this contact map onto its closest physical counterpart, 
i.e., we create a contact map which is "close" to the starting one, as measured by the Hamming distance, 
and is guaranteed to be physical, i.e., there is a real chain conformation which has that contact map. To 
achieve this result, we construct a backbone conformation in cartesian space and try to force it to have the 
contacts specified in the input contact map. If an input contact map is non-physical, existing contacts are 
discarded and possibly new ones are introduced. Since, however, any difference in the number and locations 
of contacts is penalized, the contact map of the resulting conformation is necessarily close to the starting 
one (Vendruscolo et ai, 1997). Monomers are not allowed to invade each other's space. This is ensured by 
introducing a lower threshold Rl, below which they experience a hard-core repulsion. The lowest Cq-Cq, 
distance found in PDB proteins is around 3.5 A. We chose Rl=5.0 A. With such a value, it is still possible to 
reconstruct all the PDB proteins and the tendency to create too compact structures, typical of the contact 
energy approximation, is minimized. Result of the reconstruction is shown in Fig. E^. 
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D. Refinement: 



We perform further optimization by an energy minimization in real space using a standard Metropolis 
crankshaft technique (Sali et ai, 1994; Vendruscolo and Domany, 1998a). Result of the minimization is 
shown in Fig. In this calculation we used a set W153 of contact energy parameters, that was derived 
using the method presented by Mirny and Domany (1996), applied to the database of 153 proteins listed in 
Vendruscolo et. al (1998) with the present definition of contact. The initial energy of the native fold of lubq 
is 25.72 and the energy of the final map is much lower, -84.20. 
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FIG. 4. (a) Contact map for the native state of protein ubiquitin (lubq). There are 292 non-nearest neighbors 
contacts. Region 1 is an a helix, region 2 a parallel /3 sheet, and region 3 an antiparallel /3 sheet, (b) Contact 
map after a step of the non-local dynamics, (c) After a step of the local growth dynamics. The untwisting of a 
helix is shown in the box. (d) After a step of the local conservative dynamics, (e) After reconstruction, (f) After 
final minimization in real space. In (c-f) squares are the current map and circles the previous one. (Reprinted with 
permission from Vendruscolo and Domany (1998a). Copyright 1998 Current Biology.) 
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V. AN APPROXIMATION FOR THE FREE ENERGY 



First, we introduce the exact free energy of a contact map and discuss two simple approximations 
to it. Second, we present a method to derive energy parameters based on perceptron learning. 



A. The (free) energy associated with a contact map 

As explained above, many microscopic configurations of a protein with sequence A = (oi, a2, a^, ...a^) 
are characterized by the same contact map S. We now show that one can define an exact free energy, 
Ti{ A , S' ) , for the assignment of S to the sequence A. Denote by C a micro-state of the system, specified 
by the coordinates of all atoms of the protein (and of the solvent and any other relevant molecules). The 
true, microscopic energy of this configuration is E{C). In thermal equilibrium each micro-state appears with 
a probability proportional to the corresponding Boltzmann weight e~'i^^^^\ 

The free energy TL[ A , S ) (to which we refer simply energy) associated with sequence A and map S is 
defined as follows: 

Prob(5) cx e-«( ^ ' ^ ) = ^ e"^^(C) A(C, S) (13) 

c 

where 

\ir Q\ — ] ^ ^ consistent with C 
^ ' ' 1 otherwise 

A(C, S) is a "projection operator" that ensures that only those configurations C whose contact map is S 
contribute to the sum (|l3|). In other words, only those micro-states whose contact map is S contribute to 
the sum and hence io TL{ A , S ). 

This definition of the (free) energy of a map is exact; it is nothing but the negative log of the probability 
of observing the map S for sequence A. Therefore 7i( A , S ) has an important property; inasmuch as the 
native fold's contact map, Sq, has the highest probability of appearing the corresponding (free) energy, 
H(So, A) is the lowest among all possible 7i( A , S ), i.e. 

H(A,So) <H(A,S) VS^So (14) 

The main problem with this exact energy is that the sum ( [T^ ) is impossible to carry out. Therefore, one 
takes various phenomenologically motivated guesses for the form of 7i( A , S ), that presumably would have 
been obtained had the sum been carried out. This approach is related in spirit to the phenomenological 
Landau-Ginzburg type free energy used in several areas of condensed matter physics. We also start from the 
simplest approximate form for this complicated function - that of the pairwise contact energy: 

N 

^pa^r^ A , S ) = ^ S,,w{a,, a,) (15) 

That is, if there is a contact between residues i and j, the parameter w{ai, Oj), which represents the energy 
gained by bringing amino acids and Oj in contact is added to the energy. 

Another term the has been introduced (Mirny and Domany, 1996) in the same spirit is a hydrophobic, (or 
solvation) term, of the form 



N 



N 



(16) 
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Here n{a) is the desired number of contacts that amino acid a should have - as obtained from the PDB, for 
example. A hydrophobic residue will have a larger value of n than a hydrophilic one. The term penalizes 
the configuration for deviations of the actual number of contacts of a^, as read from the map S, from the 
desired n{ai). Whereas the terms in TYP""" are specific to the residue with which is in contact, here ai 
does not "care" who are its partners. There may be some "double counting" in using both energy terms, but 
this should be compensated in the way one determines the parameters w{a, b) and /3(a), the latter of which 
measure the strength of these hydrophobic terms. In our approach we determine the parameters by learning, 
with no recourse to a microscopic interpretation, so that we should not worry about double counting. 

B. Optimization 

In the energy function of the form of Eq. (^5|) one has to decide which set of parameters w to use. The 
first idea was to derive a particular set of energy parameters from amino acid pairing frequencies observed 
in available crystallographic structures (Tanaka and Scheraga, 1976; Miyazawa and Jernigan, 1985). 

An alternative method was proposed later by Maiorov and Crippen (1992). Energy parameters are derived 
by requiring that the observed native structures should be the lowest in energy among an ensemble of 
alternative conformations. Given the native contact map So, for each alternative map S they wrote an 
inequality 

HP"*'^(A, So, w) < HP^'^A, S, w) , (17) 

and obtained the set w from the solution of such inequalities. It was also realized that parameter derivation 
can be formulated as an optimization problem. For a fixed set of sequences with their native maps and for 
a fixed set of alternative contact maps, one defines a cost function in parameter space and looks for the set 
of parameters of "minimum cost". Goldstein et al. (1992) maximized the ratio R between the width of the 
distribution of the energy and the average energy difference between the native state and the unfolded ones. 
More recently Mirny and Shakhnovich (1996) expressed the Z score as a function of the energy parameters 
which were then derived by optimization. 

C. Learning the energy parameters by a perceptron 

We set out to determine the energy parameters on the basis of the basic requirement ( |l^ . The basic 
requirement expresses the condition that the energy should attain its lowest value at the true native map. 
It can be stated by posing the following question: 

Is it possible choose energy parameters such that for all the proteins in the database the native 
states have the lowest energy among all possible decoys? 

We choose one protein (or more) with a known native map Sq and generate for it a large set of decoys S^ ; 
then we look for 210 contact parameters for which the inequalities (|I^) hold for all S^. 

The basic attribute of TYP""" that we use is its linearity in the contact energies w{a,b); for any map S^ the 
energy (^5|) is a linear function of the 210 contact energies that can appear and it can be written as 

210 

H^"^'''(A,S^,w) =^iV,(SJu;, (18) 

c=l 

Here the index c = 1, 2, ...210 labels the different contacts that can appear and Nc{S^) is the total number 
of contacts of type c that actually appear in map S^. The difference between the energy of this map and 
the native one is therefore 

210 

AH^-'-^J^u'cxJf =w.x^ (19) 

c=l 
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where we used the notation 



= N,{S^) - N,{So) (20) 

and So is the native map. 

Also the difference in the hydrophobic energy between the decoy and the native map can be cast in a 
linear form. To see this, note that 

20 

Jihydro^^^ S^, V) = ^ M,iS^)v, . (21) 
c=l 

In this case the index c runs over the 20 species of amino acids and 

2 



N 



i=l 



N 

Sik,^ - n{ac) 



S{a,,c), (22) 



where S{ai, c) = 1 if = c and otherwise. 

The difference between the hydrophobic energy of this map and the native one is therefore 

20 

A7^;:^^™ = ^«c2/e^=v.y^ (23) 

c=l 

where we used the notation 

yii = M,{S^) - M,(So) (24) 
In the presence of the hydrophobic term, the inequality ( p^ can then be written as 

^^pair ^ ^^hydro ^ w • + V = U • > , (25) 

where we introduced the following 23G-componcnts vectors 

z = (a;i, . . . ,a;2io,yi, • ■ • ,y2o) ,2Q) 

U = (Wi, . . . , W210,Wl, • ■ • ,W2o) • 

The total energy is linear in the parameters w and v. 
We denote the normalization factor of the vector z^ by 

230 

Z, = Y.{zn\ (27) 

c=l 

1/2 

and from here on we set Z^<— z'^jZp! , so that both z and u are normalized to 1. 

Once the requirement (p^), has been expressed using an energy in the form (|2^), the question whether it 
does or does not have a solution reduces to deciding whether a set of examples is learnable by a perceptron 
(Rosenblatt, 1962). Every candidate contact map provides a pattern for the training session. 

The vector u is "learned" in the course of a training session. The P patterns are presented cyclically; after 
presentation of pattern /i the weights u are updated according to the following learning rule: 

(U + 77Z^)/|U + T^Z^I if U-Zp<0 

(28) 

u otherwise 

This procedure is called learning since when the present u misses the correct "answer" /i^ > for example 
/i, all weights are modified in a manner that reduces the error. No matter what initial guess for the u one 
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takes, a convergence theorem guarantees that if a solution u exists, it will be found in a finite number of 
training steps. Since the parameters v have to be positive, we introduced the following trick. We added 20 
fictitious examples z which are vectors of zeros except the component yi,i = 1, ... ,20 which is set to 1. 

Different choices are possible for the parameter ij. Here we use the learning rule introduced by Nabutovsky 
and Domany (1991) since it allows, at least in principle, to assess whether the problem is learnable. The 
parameter rj is given at each learning step by 

+ 1/d (29) 



1 - h^/d 

where the parameter d (called despair) evolves during learning according to 

d' = , ^ + " . (30) 

Initially one sets d = 1. 

The training session can terminate with only two possible outcomes. Either a solution is found (that is, 
no pattern that violates condition ( p5[ ) is found in a cycle), or unlearnability is detected. The problem is 
unlearnable if the despair parameter d exceeds a critical threshold 

1A//2 



d>dc = ^M[2Z,^a,] ' , (31) 

where M is the number of components of w, and Z,nax is the maximal value of the normalization factors 
(see Eq (p7|)). This value of dc is easily derived for examples of the type of Eq (|2^), using the same method 
as given by Nabutovsky and Domany (1991). 

It is easy to see that u* = (w* , v* ) is a solution of the system ( p5| ) if and only if = (w* , Av* ) is a 
solution of 

1 . ^ 

w • + - V > . (32) 

The outcome of the learning process does not depend on the initial choice of A. However, the learning 
time usually depends on A which has to be chosen conveniently. A useful condition to set A is to obtain on 
average |x| ~ |y/A|. 

VI. RESULTS 

We prove in an extensive number of situations that the pairwise contact approximation both when 
alone and when supplemented with a hydrophobic term is unsuitable for stabilizing proteins' native 
states. 

Whether or not the basic requirement ( p^ ) can be satisfied depends on several factors. We will discuss the 
dependence upon the following issues: 

1. The definition of contact. 

2. The assignment of the contact length Rc. 

3. The number Mp of proteins in the database. 

4. The method used to produce decoys. 
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A. Threading 



We first discuss the dependence of learnability on Mp and Rc using the all atoms definition of contact 
and producing decoys by gapless threading. Once the contact maps of the Mp proteins in the database are 
obtained, decoys are generated for a given sequence of length N from the structures of proteins of lengths 
N' (> N) by selecting submaps of size N x N along the main diagonal of the contact map of the longer 
protein. For each definition of contact that we considered we found two "phases" . There is a region in the 
(i?c, Mp) in which the problem is learnable; that is, there is a set w of pairwise contact energy parameters 
that stabilize simultaneously all the native maps in the set. On the other hand, outside this region (e.g. for 
fixed Rc and large enough Mp) the problem is unlearnable, and no set w exists. 

Without doing any calculation, we can predict a few general features of the {Rc,Mp) phase diagram. 
Having set a definition of contact (e.g. the all atoms one) one can plot the distribution of distances between 
amino acid pairs. Choosing Rc smaller than the smallest observed distance would result in contact maps 
with no contacts, independently of the conformation. No set of energy parameters can then discriminate 
the native map from the decoys. Similarly, choosing Rc larger than the largest observed distance would 
result in contact maps with all the entries set to 1. In this case again no discrimination is possible. Thus we 
expect to find a window of learnability in Rc- It is also reasonable to expect that such a window will shrink 
with increasing Mp. The problem is thus reduced to investigate whether such window remains open for an 
arbitrary large value of Mp, or it closes for Mp large enough. 

The boundaries of the region of learnability must be interpreted in a probabilistic sense. At given Mp and 
Rc, learnability depends on the particular choice of the proteins in the database. In principle one should 
define P{Rc, Mp), the probability for a randomly extracted set to be learnable at {Rc, Mp). The boundary is 
then defined by P{Rc, Mp) — const. In the present study, we chose not to give a precise numerical estimation 
of P{Rc, Mp), which would require many extractions of sets of A'lp proteins from the PDB and would be 
numerically intensive. We are interested in establishing the existence of the boundary rather than in its 
precise determination. Hence we will show that it is possible to find unlearnable datasets when their size is 
large enough. 

The approximate boundaries of the region of learnability are shown in Fig. ^. The window of learnability 
shrinks to zero for Mp above 270. The precise value of the limit of learnability depends on the particular 
choice of proteins, and the fiuctuations in Mp are of the order of few tens of proteins. Such fluctuations are 
larger than expected; they are due to a few proteins that are markedly more difficult to learn than others, 
and their inclusion in the dataset lowers the chances of learnability. For example IvdfA and IgotG are are 
two such "hard to learn" proteins. IvdfA is a chain of 46 amino acids which forms a single a helix; IgotG 
is a chain formed by 2 a helices hinging at 90 degrees. 

Each point shown in Fig. ^ is derived from a few (1 -3) randomly generated sets of Mp proteins in each. 
A full circle indicates that all the sets considered were learnable; open circles signal that at least one set 
was unlearnable. The largest fiuctuations are found for small Mp at the right boundary. For example, for 
different sets with Mp=20 we found that the maximal Rc of learnability varies between 28 A and 31 A. 

In this study, we selected five sets of protein structures from the PDB. The sets ranged from 154 to 945 
proteins. The PDB is an archive of experimentally derived structures of proteins. To date, 8295 coordinate 
entries are deposited. It is known that the information contained in the entire PDB is redundant and some 
is incorrect. Routine methods are available to select subsets of non-homologous proteins whose experimental 
structures are determined reliably (Heringa et ai, 1992; Hobohm and Sander, 1994; Hooft et al. 1996). 

Correlations between solutions at different Mp and Rc ranged from 0.22 (for a solution at Rc — 7.5 A and 
Mp = 200 with a solution at Rc = 4.5 A and Mp = 154) to 0.94, which is the typical correlation between 
two solutions of maximal stability at Rc — 4.5 A and Mp > 150. 

These findings indicate that with the all atoms definition of contact and the physically motivated choice 
of Rc = 4.5 A (Mirny and Domany, 1996) it is possible to stabilize simultaneously, with high probability, 
about 200 randomly selected non homologous proteins against decoys generated by gapless threading. 
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FIG. 5. Region of learnability for the all atoms definition of contact. Several sets of Mp proteins were generated 
for each value of R^. Full circles indicate that all the sets considered for a particular value of {Rc, Mp) were learnable; 
otherwise we use open circles. 



Which decoys are more challenging? It is instructive to consider the overlap Q between the contact map 
of the native state and of the decoys, which is defined as 

1 ^ 

^=max(iV,",Ar.-) J^j'^'^'^'' ^''^ 

where N is the length of protein and and iV^ are the number of contacts in S° and S'' , respectively. We 
considered, for the Cq definition and Rc—8-5 A, two sets of proteins. The first, of 123 proteins, is learnable 
and the second, of 141 proteins, unlearnable. First, for each decoy we calculated, using as initial weights the 
solution of an independent set of 197 proteins, the energy difference ATi, with the native state and the overlap 
Q. In Figs. ||(a) and (b) we present scatter plots of of ATi. on Q for our decoys. The first question we asked 
is whether decoys of high overlap are the more challenging ones. To answer this question we repeated the 
learning procedure by considering only decoys with Q < Qt, where Qt is a threshold value for the overlap. 
The set of 141 proteins was still unlearnable for Qt = 0.6. It seems that the "difficult" decoys are spread 
over the entire range of Q. Including all the decoys in the learning procedure we were able to learn the set 
of 123 proteins, but not the set of 141 proteins. As shown in Fig. ^(c), after learning the set of 123 proteins, 
decoys of low energy are present in the approximate range 0.2 < Q < 0.8. The important finding is that 
the unlearnable case is not qualitatively different (see Fig. ||(d)). Also in this case decoys of low energy 
are present in the range 0.2 < Q < 0.8, although now some of them have AH < 0. The difference is that 
with the set of 123 proteins there are 805,938 decoys, whereas for the set of 141 proteins, 1,071,753. With 
210 energy parameters it is possible to satisfy the smaller set of inequalities but not the larger. Decoys of 
arbitrary Q enter in the learning process and one can argue that the lack of correlation between ATi. and Q 
is a major cause that renders the problem unlearnable for large enough Mp and, further, that an improved 
energy function should first of all provide such a correlation. 
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FIG. 6. Scatter plot of the energy diflerence Ti. between decoys and native state and the overlap Q with the 
native state, (a) Learnable set of 123 proteins for which the energy 7-^?"'"' was calculated using the solution of an 
independent set of 197 proteins; (b) Unlearnable set of 141 proteins with the same initial set of energy parameters; 
(c) The set of 123 proteins with the energy parameters obtained from learning; (d) The set of 141 proteins with the 
energy parameters arrived at when unlearnability has been established. 

To investigate the dependence of learnability on the definition of contact we repeated the same analysis 
for two other definitions of contacts. The first based on and the second on all carbon atoms - a contact 
was assumed if any two C atoms of the two amino acids were closer than Re- In both cases we found regions 
of learnability qualitatively similar to the one shown in Fig. |^ 

To summarize, we showed that, no matter which definition of contact and which Rc are chosen, there is 
always a maximal number (of the order of few hundreds) of proteins that can be stabilized together. Is this 
number large or small? A definite answer is outside the limits of this study. It is small when compared with 
the total number of proteins existing in nature (hundreds of thousands) , but it is fairly large when compared 
with the number of representative proteins in the PDB (also of the order of few hundreds (Hobohm and 
Sander, 1994). We choose to generate decoys by gaplcss threading since it is a very efficient way to obtain 
decoys. To further generalize our conclusion, in the following sections we consider a more refined way to 
obtain decoys, based on energy minimization in the space of contact maps. 
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B. Crambin 



The conclusion drawn from threading is that there is a maximal number of proteins that can be stabihzed 
together using the pairwise contact approximation to the energy. One can ask a more hmited question, 
namely if it is possible to fine tune energy parameters to stabilize just one protein, or possibly a set of 
proteins belonging to the same structural family. Threading offers poor contenders to the native state. 
Better candidates for the ground state are produced by contact map dynamics, the method presented in Sec. 



[V, which explores in an efficient way the space of contact maps. For a particular protein - crambin - we 
show in Fig. |^ that the decoys produced by threading are far less a challenge than those produced by contact 
map dynamics. The set of decoys obtained by threading can be learned - it is possible to produce a set of 
pairwise contact energy parameters that stabilizes crambin in a threading experiment. Here we ask the same 
question for decoys obtained by contact map dynamics (see point 4. at the beginning of this Section). 
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FIG. 7. Histograms that demonstrate the difference in energy between ensembles of contact maps obtained by 
threading and by energy minimization, shown for different contact energy parameters: (VND) as obtained in this 
work, by finding a solution for a threading ensemble; (HL): (Hinds and Levitt, 1994); (MD): (Mirny and Domany, 
1996); (MJ): (Miyazawa and Jernigan, 1996); (MS): (Mirny and Shakhnovich, 1996); (TD): (Thomas and Dill, 1996). 
The energy parameter sets were shifted and rescaled to obtain (w) = and (w^) — (w)^ = 1 (averages are over the 
210 energy parameters). Energies were shifted to set the native state to E—0. 

Crambin (Teeter et ai, 1993) is a protein of length N — 46. We constructed its native map by taking 
the coordinates of the Cq, atoms from the PDB and using a threshold of 8.5 A to define contacts. In 
the crambin chain, 5 out of the 20 amino acids do not appear and 3 appear only once. Thus, among the 
corresponding 210 possible contact energies, only 117 parameters can effectively enter the energy (|l5| ) for 
any set of candidate maps. These parameters form a 117-component vector w. The native map contains 
187 non-nearest neighbor contacts, which involve only 72 of the 117 possible contact energy parameters. We 
summarize here our main result about the question we have addressed in the present work. We will present 
below decisive evidence supporting our main conclusion: 

the problem of fine tuning the pairwise contact energy parameters to stabilize the native state of 
crambin is unsolvable. 
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1. Learning the pairwise contact term 



In an unlearnable case there exist sets of examples for which no solution can be found; for large enough 
P the training set will include, with non-vanishing probability, such an unlearnable subset. 

The condition that has to be met (Nabutovsky and Domany, 1991) to establish unlearnability, is that the 
despair d, recorded during the learning process, should exceed the critical value dc- According to Eq (|3l|), 
for M — 117, we typically get a critical despair dc — lO'^^^. 

In perceptron learning of P > 2M randomly generated examples (which is an unlearnable problem (Cover, 
1965; Gardner, 1988) for large M), d grows exponentially with learning time (Nabutovsky and Domany, 1991). 
Here, due to the specific non-random nature of the learning task, we need much more than P ~ 2M = 234 
examples. Learning is realized in an iterative manner. Starting with an initial set of energy parameters we 
generate a set of examples that are then learned; the new set of energy parameters is used to generate new 
examples and so on. We will refer to each such iteration as an epoch. It was necessary to generate P=298710 
examples in 11 epochs. We underscore the fact that each example was obtained by an energy minimization 



procedure, using the method discussed in Sec. \S . Learning this set of examples is an unsolvable task, 



proved by the divergence of the despair d. We reached d > dc after approximatively 37500 learning cycles. 
To speed up the procedure, without affecting the final result, we selected the 10000 examples at epoch 11 
that had the lowest energy according to the energy parameters obtained from learning at epoch 10. 



2. Energies of the false ground states 



Even though the problem is unlearnable, our procedure produces contact energies that have several ap- 
pealing features. The first is that whereas for the existing contact potentials it is very easy to find maps 
whose energy is below that of the native map, with the w obtained after several training epochs this becomes 
a difficult (albeit possible) task. With the initial energy parameters the vast majority of the contact maps 
that are generated have a lower energy than the native contact map. As can be seen from Fig. ^ where the 
energy scale is shifted so that the native contact map has always zero energy, for increasing epoch index, 
the energy distribution shifts to the right and becomes narrower. Learning is thus accompanied by an im- 
provement of the Z-score, which is a commonly used estimator of the quality of a set of energy parameters 
(Mirny and Shakhnovich, 1996). Hence our learning procedure fiattens the energy landscape, reducing both 
the number of violating examples and their energy difference with the true native state. Another relevant 
question concerns the "location" of these false minima, i.e. how different are the corresponding structures 
from the native one? To study this, we reconstructed the three dimensional conformations corresponding to 
the violating examples and measured their average distance D (see Eq. |ll|) from the native conformation. 
We found that D does not decrease with the epoch index; moreover false minima are found at an approxi- 
mate average D of 8 A at any epoch. Only their number decreased significantly. Two compact uncorrelated 
conformations are typically found at a distance of 15 A, which is also the result we obtained using other 
parameter sets taken from the literature. 
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FIG. 8. Normalized histogram H{Ti.) of the energies of the contact maps at epochs t = 0,1,2 and 10. The energy 
scale is shifted so that the native contact map energy is 0. (Adapted with permission from Vendruscolo and Domany 
(1998b). Copyright 1998 American Institute of Physics.) 



Here, we define the overlap Q between contact maps in a slightly different way as than in Eq. ( |33| ) 



Q 



N, 



(34) 



where Np is the number of contacts present in the native map and in the predicted one and Nc is the total 
number of contacts of the native map (contacts {i, i + 1) and {i, i + 2) are not counted). We generated 1000 
low energy contact maps, using the set of energy parameters obtained at epoch 10. The histogram of the 
fraction of contact maps with a given overlap Q with the native state is shown in Fig. 5. The distribution 
is peaked at around 0.40, which means that typically we are able to correctly recover 40 % of the native 
contacts. We analyzed the distance D of the conformations corresponding to the maps of Fig. ^. We found 
that in the worst cases conformations are found at an approximate average D of 8 A . It is not possible to 
improve this result, since the contact energy parameters cannot be optimized further. 
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FIG. 9. (a) Result of the folding experiment on crambin described in the text in which we generated 1000 low 
energy decoys. Q is the fraction of correctly predicted contacts and Ti is the energy difference between the decoy 
and the native state. (Adapted with permission from Vendruscolo and Domany (1998b). Copyright 1998 American 
Institute of Physics.) (b) Typical low energy map obtained during the simulation. Small dots indicate contacts in 
the native map, open circles contacts in the simulated map. 

Consistently with our conclusion that the pairwise contact energy approximation is unsuitable for protein 
folding, Hao and Scheraga (1996) showed that a more accurate parametrization of the energy is capable of 
improving the performance in a folding simulation. They found that "when the crambin model is simulated 
to fold to the low-energy state with the optimized energy parameters, the folded structure is generally close 
to the target native structure" , although "the energy of the target structure was never able to separate 
completely from the continuous density of the nonnative states through the optimization" . They do find 
nonnative conformations with energy lower than the native state, but the contact maps of these have, on 
the average, high overlap with the native map. 

One should not be misled by similar work on model proteins where, using a pairwise contact energy 
function, it is possible to discriminate the native state, either by optimization (Hao and Scheraga, 1996; 
Mirny and Shakhnovich, 1996; Hao and Scheraga, 1998) or by imposing conditions analogous to Eq ( [l7| ) 
(Van Mourik et ai, 1998). In this case, a database of foldable sequences is designed using a pairwise contact 
potential and subsequently a set of contact energy parameters is recovered. Success in this case is possible 
because the contact energy of Eq (nst) is the exact form of the free energy of the model. 



3. Learning the fiydrophobic term 



We proved that the pairwise contact approximation is unsuitable to stabilize the native contact map of 
crambin against a set of decoys obtained by contact map dynamics. The next step we take is to improve it 
by adding the hydrophobic term (|6|) and ask again if the native state of crambin can be assigned with the 
lowest energy. 

We found that the previously used set of 298710 decoys was learnable with the hydrophobic term. We 
increased the number of decoys to 390117 in 15 epochs and in this way we obtained an unlearnable set. After 
some trial we choose A = 15 in eq. ^ since we found that the despair d grows faster for this choice. 

Introducing the hydrophobic term does not make the problem learnable. However, it improves the situation 
depicted in Fig. ^. We repeated the folding experiment previously done (see Sec. VI B 2). We found that 
the typical value of Q is larger, as shown in Fig. |l0|a. The average value of Q moves from 0.4 (pairwise term 
only) to 0.5 (hydrophobic term). These two values should be compared with 0.2, which is the value obtained 
for decoys produced by threading crambin into a set of 456 proteins (the distribution is marked by a T in 
the figure). 



C. Immunoglobulins 

Crambin could be a peculiar case since it is a short protein, it has only 15 species of amino acids, etc. 
To what extent the results about unlearnability extends to other proteins ? To answer to this question we 
undertook a study of a set of proteins of the family of immunoglobulins. 

An immunoglobulin molecule is formed by two light chains and two heavy ones (Branden and Tooze, 1991) 
held together by disulfide bridges and by the packing of their respective domains. The light chain folds into 
two domains, the variable Vl and the constant Cl and the heavy chain folds into four domain, one variable 
Vh and three constant Chi, Ch2 and Chz- Sequence similarity in constant domains is typically above 35 
% and around 10 to 20 % in variable domains. 

We focused our study on the variable domains of the light chain Vl which is formed by two /3-sheets, one 
of four strands and the other of six strands. The reason to choose immunoglobulins is mainly due to the 
extremely rich amount of information available. The sequence database of Kabat et al. (1991) now contains 
about 5700 different variable domains and the structure of about 140 are available form the PDB. 
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We asked the question if it is possible to stabilize 6 Vl domains simultaneously. The domains chosen were 
8fab A, Ibaf L, Icbv L, Idba L, 2fl9 L and Ifdl L, whose respective lengths were 104, 107, 112, 107, 107 and 
107 amino acids. 



1. Learning the pairwise contact term 

As opposed to the case of crambin, where only 117 pairwise contact species are present, all the 210 possible 
ones are present for immunoglobulins. To prove unlearnability, we adopted the same iterative procedure used 
for crambin. We generated 97309 examples, in 6 epochs. To speed up the procedure we considered only a 
subset of decoys, selected according to the following procedure. Consider the component c of the vector x'' of 
a particular decoy fi. If the numbers Nc{Sf^) and Nc{So) are equal then = and this particular decoy can 
never be used to update Wc- If we initialize Wc = and use only decoys with Xc = then Wc = during the 
entire learning procedure. We excluded all the decoys having contacts of 13 species, in this way we remained 
with 35677 decoys with 197 contact species. Obviously, for such a subset, adding the 13 energy parameters 
cannot change the result, since they never enter in the calculation of the energy. We further selected the 
5000 decoys of lowest energy (according to the energy parameters learned at epoch 5) and performed the 
learning on this subset. In this way the critical value of the despair was surpassed in an amenable computer 
time. 



2. Learning the hydrophobic term 

The set of 97309 decoys was learnable by adding the hydrophobic term. We enlarged it to 110576. Using 
the same trick explained above, we excluded from the learning 13 pairwise contact energy terms. This 
selection procedure identified a subset of 42798 decoys for which we proved unlearnability. 

The failure to assign the lowest energy to the native contact map is more severe in the case of immunoglob- 
ulins. The low energy contact maps in the case of crambin did resemble the native map to some extent, as 
shown in Fig. 0. In the case of immunoglobulins there is very little similarity between low energy maps and 
the native one (see Fig. p^). 

We think that this result deserves an explanation, which we propose in the following argument. As 
already done for crambin, we obtained a set of decoys by threading for immunoglobulins. Such a distribution 
is narrower and shifted to the left with respect to the corresponding distribution for crambin. We assume 
that threading offers random protein-like conformations. Therefore, the shift is originated by the exponential 
increase of the number of physical contact maps (Vendruscolo et ai, 1999) with the length of the proteins. The 
improvement, as measured by the right-shift of the distribution of Q, obtained by learning energy parameters, 
is comparable for crambin and for immunoglobulins. However, since immunoglobulins are longer, one would 
have to work harder to obtain energy parameters suitable for folding - many more decoys are to be weeded 
out. Since we proved that it is impossible to tune better the energy parameters for both approximations 
of the energy that we discussed, the only choice is to develop better forms for the energy. This will be the 
object of future study. 
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FIG. 10. (a) Histogram of the distribution of the overlap Q between maps of low energy (LEM) and the native 
map found in the simulation with the hydrophobic term for immunoglobulins and for crambin. For comparison, the 
distributions obtained by threading are also shown (T). (b) Typical low energy contact map (open circles) generated 
during the simulation of Idba. For comparison the native map is also shown (full dots). 



VII. FINAL CONSIDERATIONS 



We presented the results of our attempt to perform protein folding in the space of contact maps. 

• We demonstrated the feasibility of the original idea of performing energy minimization in contact map 
space. 

• We proved that two simple phenomenological approximations to the free energy can not possibly be 
tuned to assign the lowest energy to the observed native conformation of even one single protein. 

• We leave open the possibility to use better approximations for the free energy and more detailed 
representations of the chain underlying a contact map. 

• We also leave open the possibility of predicting native states by other methodologies that do not use 
energy minimization. 
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